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Abstract 

We propose a method that uses genetic data to test for the occurrence of a recent range expansion and to 
infer the location of the origin of the expansion. We introduce a statistic for pairs of populations ip (the 
directionality index) that detects asymmetries in the two-dimensional allele frequency spectrum caused 
by the series of founder events that happen during an expansion. Such asymmetry arises because low 
frequency alleles tend to be lost during founder events, thus creating clines in the frequencies of surviving 
low-frequency alleles. Using simulations, we further show that ip is more powerful for detecting range 
expansions than both Fst and clines in heterozygosity. We illustrate the utility of tp by applying it to a 
data set from modern humans and show how we can include more complicated scenarios such as multiple 
expansion origins or barriers to migration in the model. 

Author Summary 

Many important biogeographic processes can be interpreted as range expansions, where a species is 
constraint to a small local habitat in the beginning and expands from there to colonize a larger region. 
Examples of this include biological invasions, the spread of infectious diseases and humans colonizing the 
world. We present a statistical framework to test for the existence of such an expansion and examine 
its properties. We then use this framework to make further inference, and show how it can be used to 
estimate the origins of the range expansion and to identify barriers of gene flow. 

Introduction 

Range expansions are ubiquitous in natural populations, and they are responsible for numerous biolog- 
ical phenomena. Range expansions result in a series of founder events that cause the newly founded 
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26 populations to differ genetically from the source population. Some well-known examples are biological 

27 invasions pQ, the post-ice age patterns of migration in several European [2JI3], an d the colonization of 
2s Eurasia and North and South America by modern humans [3HS]- I n some cases descendants from the 

29 source population remain near the location of the ancestral population. For example, the European 

30 population of the brown bear Ursus arctos most likely survived the last ice age in refugia in Spain and 

31 Greece. Brown bears followed the receeding glaciers to colonize most of Europe, but populations at the 

32 locations of the former refugia persisted until the populations were driven to the verge of extinction by 

33 humans in the 20th century [7]. Another example are humans, where derived populations are found all 

34 over the world, but there are also descendants of the first humans still living in Africa. 

35 Sometimes, the routes of migration are known from direct observations, historical records and archae- 

36 ological evidence. Frequently, however, the exact history of a species is unknown, and we want to use 

37 population genetic methods to gain more information. In this paper, we use genetic data to address two 

38 related problems: detecting whether a range expansion has occurred and inferring the geographic origin 

39 of a range expansion. 

40 Characterizing the influence of geographic structure on genetic diversity has been one of the major 

41 goals of population genetics theory, with important contributions from Wright [8] , Malecot [3] , Kimura [TU] 

42 and many others. While there are many statistics designed to infer differentiation between populations 

43 [11H14] , the most widely used statistic to detect differentiation between populations is the fixation index 

44 Fst, which traces to Wright [15]. A variety of estimators of Fst have been developed (e.g [T51[TB] ). 

45 Roughly speaking, Fst measures how much diversity exists between subpopulations compared to the 

46 diversity in the entire population; a value of indicates that the two subpopulations are indistinguishable, 

47 whereas a value of 1 indicates that two populations are maximally differentiated. Fst h as been directly 

48 linked to the migration rate in several models, including the finite island |17| and stepping-stone models 

49 [18j . Although Fst can be used to quite accurately estimate the amount of gene flow between equilibrium 

50 populations, it cannot be used to infer directionality of gene flow. 

51 Two other methods that are widely used to detect geographic patterns are clustering algorithms 

52 and ordination methods. Clustering analyses [T!?H2"2"] such as STRUCTURE [TH]) classify individuals into 

53 discrete groups, which can then be used for further analysis. Ordination techniques |23j . such as principal 

54 components analysis and multidimensional scaling summarize data by indicating the overall similarity 

55 of populations. Principal component analysis has been shown that genetic diversity relatively closely 



3 



56 correlates to the geographic distribution of humans on a continental |2I] and global [ISJdS] scale. 

57 It is also possible to use likelihood methods to infer past features of population history. For example, 
5s the program IM |27j estimates the time of separation of populations and migration rates between them 

59 using data from multiple unlinked loci, and the program dadi [28j estimates past rates of population 

60 growth from the joint allele frequency spectrum from two or three populations. Both of these programs 

61 arc computationally intensive and neither can analyze data from numerous populations. 

62 Most statistics applied to subdivided populations do not provide information about asymmetries. 

63 Fst and most genetic distances are defined in such a way that they are commutative (i.e. Fst between 

64 populations A and B is the same as Fst between B and A), and hence the value depends only on the 

65 amount of migration, not whether migrants moved mostly from A to B or from B to A. Clustering 

66 algorithms can produce groupings of populations that can be interpreted as describing an expansion, but 

67 the expansion-specific information is lost in the process and the results of clustering is often sensitive 
6s to tuning parameters such as the number of clusters. For principal components analysis, the view that 

69 the first principal component axis follows the direction of expansion |29j has recently been challenged 

70 and it has recently been shown that, depending on parameters and the locations of the 

71 populations sampled, the first principal component axis might be parallel to or orthogonal to the axis of 

72 expansion, or at an angle in between. 

73 Population genetics theory has shown that a range expansion can be detected from the characteristic 

74 reduction in genetic diversity with increasing distance from the origin of the expansion [5l l3lH3"5l . The 

75 reason is that the succession of founder events during the expansion cause the progressive loss of genetic 

76 variants. This prediction has been confirmed by comparing the numbers of mtDNA haplotypes found in 

77 Southern European rcfugia and in central Europe [7J. The same pattern can also been seen in humans 

78 where both a reduction in heterozygosity and an increase in linkage disequilibrium with increasing distance 

79 from the presumed origin of the expansion in Africa can be shown [5] . 

ao In addition to creating a gradient in genetic diversity, range expansions tend to create clines in the 

si frequencies of neutral alleles, with the frequency increasing on average in the direction of the expansion 

82 [35) . An intuitive reason for this pattern is that each founder event results in additional genetic drift, 

83 and populations further away from the origin of expansion will therefore have experienced more genetic 

84 drift. This can be seen from the following simple argument: The expected frequency of an neutral allele 
as in the new population is the same as in the source population. But some alleles will have zero frequency 
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86 in the new population. Therefore, the average frequency of shared alleles, i.e. alleles that are present in 

87 the new population, is expected to be higher than in the source population, thus creating the cline. This 
as observation provides the foundation for our method of detecting range expansions. 

89 In this paper, we introduce a statistic, the directionality index ip, for pairs of populations, ip is 

go sensitive to patterns created by range expansions as it detects clinal patterns created by successive range 

91 expansions. We show, using simulations, that the expectation of ip is zero in an equilibrium isolation-by- 

92 distance model, and that positive values of ip indicate the direction of the expansion. We also show that 

93 if we have multiple samples, ip can be used to infer the origin of a range expansion and the location of 

94 barriers of gene flow. We also explore the power and robustness of our methods and finally apply it to 

95 human genetic data. 

96 Results 

97 In this section, we will define the directionality index, give an intuitive explanation and discuss some of 

98 its properties. We will show that the directionality index is sensitive to recent range expansions in a one 

99 or two dimensional stepping-stone model, and then explore some more advanced applications. 

100 Definition Of The Directionality Index 

101 Consider two samples of size n,n > 2 taken from two subpopulations S\,S2- Each sample consists of L 

102 biallelic markers (e.g. SNPs) that are shared between S\ and 5 I 2- The directionality index is defined as 

1 - - 1 L 

ip = - (/si - fs 2 ) = -j— Y (feui - fs 2 ,i) , (!) 

103 where fs is the average allele frequency of all alleles in population S, and fg i is the number of derived 

104 copies of allele I in the sample from population S. Equivalcntly, ip can also be defined in terms of the 

105 two-dimensional site frequency spectrum (2D-SFS): 

n n 

>■ Y^Yj' /••/., (2) 

»=1 3=1 
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ice where fij denotes the proportion of SNP in the sample that are at frequncy i in Si and at frequency j 

107 in S*2, and the SFS is normalized such that X)j<j fij = 

V = /ai - /i2- (3) 

los The three different definitions represent different interpretations of the directionality index, and it is 

log useful in building an intuition to discuss them briefly: Equation [1] corresponds to the definition alluded 

no to in the introduction, where we compare the average allele frequency between the two populations. As 

m the population further away from the expansion origin is expected to have experienced more genetic drift, 

n2 its alleles are expected to be at a higher frequency and thus ip is positive if fs ± > fs 2 an d Si is further 

u3 away from the origin of the expansion. If both populations have experienced similar amounts of genetic 

u4 drift, then the allel frequencies will be equal, ip Rj and we will not detect an expansion. Equation 

us [5] is based on the SFS, and we see that ip will be positive if fa is usually greater than fa. Thus, we 

lie arc comparing the SFS entries that are mirrored along the x = y diagonal, and the directionality index 

H7 measures the "skew" in the 2D-SFS. If there are more SNP that fall in the upper left triangle of the 

us SFS (where j > i), tJj will be negative, and we infer an expansion from Si to #2- The inverse conclusion 

us will be drawn if there is an excess of SNP in the lower right triangle, and if the SNP are distributed 

120 symmetrically around the x — y diagonal, ip will be zero. Much of the paper will be focused on the case 

121 where each population is represented by a single genome, a case we think will be particularly common 

122 in many analyses. In this case, equation [2] reduces to [3] and we are simply comparing the abundance of 

123 SNP that are fixed for the derived allele in sample Si and heterozygous in 52 to the number of SNP 

124 that are heterozygous in Si and fixed in S^- If cither number is significantly larger than the other, we 

125 infer migration in the direction of the larger number. It is also worth noting that the computation cost 

126 of equation [T] scales proportionally to the number of loci in the sample, whereas equation [2] only depends 

127 on the sample size squared. Thus, for genome scale data sets where L >> n 2 , ([2]) will be much faster to 

128 compute. 

129 Determining Whether A Range Expansion Occurred 

130 We first test the power of ip to distinguish pairs of populations sampled from a recent range expansion 

131 to pairs of populations sampled under isolation-by-distance at equilibrium in a ID-model. Figure [T] 
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132 shows that Fst increases at approximately the same rate under an equilibrium stepping-stone model 

133 with only isolation- by-distance (Panel A) and a model with a range expansion (Panel B), indicating that 

134 the two scenarios are comparable. We see that ip is constant at zero in the isolation-by-distancc model, 

135 regardless of the distance between the samples. In contrast, ip increases with distance under the expansion 

136 model, due to the increase in allele frequency along the expansion axis. Interestingly, ip increases almost 
13? linearly with the distances between the origin and the population sampled, a fact we exploit in the 
us next section to infer the origin of the expansion. We also plotted the heterozygosity, a statistic that is 
139 also expected to be constant under an equilibrium model |36| and increasing under an expansion [5ll31j. 
wo However, our simulations show that heterozygosity is larger in the center of the habitat than near the 

141 boundaries because of the boundary effects. This is in contrast to most theoretical results [36] which either 

142 assume either a circular model or an infinitely long stepping stone model, and where the heterozygosity 

143 is independent of the sampled deme. However, the observed gradient in heterozygosity has been observed 

144 previously and explained by longer coalescence times for a sample taken close to the boundary |37U38j 

145 and it is worth noting that this effect is much weaker in a two dimensional population. 

we On the other hand, Fst and ip behave in similar ways in both ID and 2D models (Figure [2]) ■ Fst 

147 is slightly larger in the case of a range expansion than in the isolation-by-distance model (Panels A and 

us C), but qualitatively we see an increase of Fst with distance under either model. The pattern for ip, 

149 however, is again different (Panels B and D): under the isolation-by-distance model, ip is smaller than 0.01 

150 for almost all comparison, with the exception of a few demes that are at the boundary of the simulated 

151 region. In contrast, the magnitude and sign of ip nicely illustrate the effect of the range expansion, ip 

152 is zero only for demes that are very close to each other or demes that are equally far away from the 

153 expansion origin. The latter can be explained by symmetry: two samples that arc an equal distance 

154 apart from the origin will have a symmetric SFS, resulting in a ip close to zero. 

155 Various significance tests can be used to determine the significance of ip between two populations; for 

156 the case of n = 2 in both samples we can simply perform a binomial test on the absolute frequencies /21 

157 and fi2- If their proportions differ significantly from 0.5, we can reject the null hypothesis of symmetric 

158 migration between the two demes. When comparing samples of size n > 2, we can generate a null 

159 distribution using a permutation test, i.e randomly assigning the allele frequencies for each SNP to cither 

160 population. However, both these tests will underestimate the variance in the data if SNPs are not in 

161 linkage equilibrium. In that case the "effective" number of loci will be lower than the actual number. To 
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«2 take linkage into account we use a computationally more intensive block-jackknifc approach [391140] to 

163 analyse our real data. 

164 In Figure [3J we show the effect of the most important parameters on our ability to reject the null 

165 hypothesis of isolation-by-distance for pairs of samples of size two. For all parameters, we find that using 

166 the directionality index results in higher power than comparing differences in heterozygosity, while false- 

167 positive rates are low and roughly the same for the two methods. We find that we have comparatively 
lea little power to reject the null hypothesis if the two sampled individuals are close to each other(Panel 

169 |3jA-). This is expected, since there are fewer founder events separating the two individuals. Therefore 

170 we expect ip to be lower for nearby populations, as shown in Figures [T] and [2] Panel B shows that 

171 a moderate number of shared SNPs is necessary, i.e. more than one thousand, to get high power to 

172 reject equilibrium isolation-by-distance. In addition, we find that slow expansions are harder to detect 

173 than rapid expansions, and more recent expansions are easier to detect than expansions that happened 

174 a long time in the past (Panels C and D). Neither of these findings is unexpected; after an expansion 

175 genetic drift will affect the loci in both populations equally The number of shared SNP that are due 

176 to the range expansion will decrease with time and be partially replace by SNP that only experienced 

177 the equilibrium model and hence do not carry a signal of the expansion. Similarly, if the time between 

178 expansion events is high, the founder effects caused by the expansion will become less important relative 

179 to genetic drift between expansion events, weakening the signal of the expansion. In this scenario, the 
lao power of heterozygosity to detect an expansion decays much faster than the power of ip. Finally, we note 

181 that the false positive rate, denoted in grey and pink in Figure [HJ is independent of both the distance 

182 between loci and the number of SNPs. 

183 Inferring The Origin Of A Range Expansion 

184 In addition to showing that range expansion occurred, the results in Figures 1 and 2 suggest that spatial 
las patterns in pairwise values of ip can indicate the origin of an expansion if we have more than two 
186 samples. For this purpose, we employ a method commonly used by engineers in problems of localization 
is? and navigation [5T], called Time Difference of Arrival location estimation (TDOA). TDOA methods are 
las used in remote sensing and to locate cell phones [5T]. The key assumption of the TDOA algorithm is 

189 that the magnitude of a pairwise statistic between two sample locations i and j is proportional to the 

190 difference in distance from i to the origin and the distance from j to the origin. So, if i is very close 
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191 to the origin and j far away, then we would expect the TDOA statistic to be large, but if i and j arc 

192 at the roughly the same distance from the origin, then the TDOA statistic should be close to zero. In 

193 engineering applications the TDOA statistic is the time difference between the arrival of a signal emitted 

194 from the origin (hence the name). In our application, however, tp takes on the role of the time difference 

195 with the implicit assumption that the directionality index between locations i and j, ipij, is proportional 

196 to the difference in distances from i and j to the origin, respectively. To further illustrate how we can 

197 use this method to infer an origin, we first consider the special case of ipij = 0- Assuming that we have 
us already rejected isolation-by-distance, we know that i and j are equally far from the origin and the origin 

199 must therefore lie on the line perpendicular to the line through i and j. 

200 If we had three or more loci all at the same distance from the origin (so that the pairwise ip values 

201 are all zero), we could thus infer the origin as the center of the circle passing through the three points. 

202 In general, however, ip will be non-zero. In that case, we know from elementary geometry that the set of 

203 candidate points based on a one pair of samples is not a straight line, but a hyperbola with i and j as its 

204 focal points. If we have samples from k locations, we can calculate ip for k(k — l)/2 samples and hence 

205 obtain k * (k — l)/2 hyperbolas. In a perfect, noiseless world, all hyperbolas would intersect in a single 

206 point: the origin of the expansion, as illustrated in Figure 01 In practice, of course, genetic data is highly 

207 stochastic and we have to estimate the origin. To do this, we interpret each hyperbola as a non-linear 

208 equation with three unknowns, the sample coordinates x, y and the speed of expansion v. v is a nuisance 

209 parameter that describes how much the allele frequency increases per unit distance from the origin. For 

210 more than three samples the system is overdetermined and, rather than solving the system of equations 
2n explicitly, we use weighted non-linear least squares. 

212 We first illustrate this approach on simulated data, where we sample a regular grid (Figure [5] We 

213 simulated a range expansion in a 101 x 101 stepping stone model. In all simulations, we chose the 

214 coordinate system such that each dome corresponds to one unit of distance. The start of the expansion 

215 is in deme (25,35), indicated by the grey dotted lines in Figure [5J The direction of the arrows plotted in 

216 Figure [5] indicate the sign of the pairwise ip- value, between adjacent samples on a grid, and the thickness 

217 of each arrow corresponds to the magnitude of ip. A missing arrow denotes a non-significant ip value. 

218 In Panel wc performed a simulation under an equilibrium isolation-by-distance model. We see that 

219 in this scenario, only 11 out of the 60 pairwise comparisons are significant; all of them point towards 

220 the corners and are due to the boundary effects of the simulations. The red ellipse is a 95% confidence 
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221 ellipse of the inferred origin. Under the isolation-by-distance model, this is located in the center of the 

222 population, illustrating that the TDOA approach will yield an answer even if there is no expansion has 

223 occured, so it is important to first test if an expansion has actually occurcd. From Panels B-D we see 

224 that the expansion signal is clearly portrayed by the directionality indeces and we get high confidence in 

225 the estimated origin. In fact, the confidence is so high that the ellipse is barely visible in Panel B, but 

226 confidence decreases when we reduce the number of samples. Furthermore, the see from Panels C and D 

227 that the origin is slightly biased towards the center of the population. This is again due to a boundary 
22s effect, and goes away if we take all samples at least 10 demes away from the boundary of the population. 

229 To assess the properties of this method more systematically, we report root mean squared error 

230 (RMSE) under several scenarios (Figurc[6]). We also compare our method to the method of Ramachandran 

231 et al. [5], who used linear regression of the heterozygosity on the distance to candidate origins. Their 

232 inferred origin of the expansion is the point with the highest associated regression coefficient, conditional 

233 on the slope of the regression curve being negative. Most data in Figure [5] was simulated with a fairly 

234 rapid expansion; the time between subsequent expansion events was set to 0.001 coalescence units, so 

235 that the complete expansion was completed in 0.13 coalescence units. This speed is roughly equal to the 

236 out-of- Africa expansion of humans. For these parameters (Figure [S^-D) the two methods have similar 

237 performance, with only marginal improvements in how the methods perform with different amounts of 
23s data. We find that with adequate numbers of samples and data, the RMSE for both method is around 

239 four, with less than one distance unit of difference between the two methods. Overall, the ideal amount of 

240 data for this method lies around 20 diploid samples and 7,000 independent SNP. Increasing the amount 

241 of data beyond this level will not substantially improve performance. For the set of simulations with 

242 increasing numbers of SNP, we also tested the effects of sampling on a grid versus taking samples from 

243 random locations. It might be assumed that the latter scenario is closer to a realistic sampling scheme. 

244 Interestingly, we found only negligible differences, indicating that the sampling locations are only a minor 

245 issue unless the sampling locations are very skewed (e.g. a transsect sample). 

246 Changing the position of the origin has little effect on the RMSE for the first 30 distance units, 

247 indicating that we have good accuracy if the origin is not close to the border. If there is an expansion 

248 that started outside the sampled range, the method will perform significantly worse. This has two causes: 

249 first, we would expect it to be easier to infer the origin if it lies in the middle of the sample, as compared 

250 to an origin that is far away from all samples. This part also explains the difference between samples 
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251 taken on a grid and random samples: In the grid, the corners are systematically sampled (since we force 

252 a grid sample to be there), whereas in many random samples there may be fewer samples on one side of 

253 the origin than on the other, resulting in a loss of accuracy. A second factor resulting in reduced accuracy 

254 are again boundary effects, which skew the effect of the expansion if it happened close to the boundary. 

255 We focus our attention on the effect of the parameters of the expansion ((BJC-F): The number of 

256 founders (Figure |6jl) has an almost linear effect on the estimation accuracy. Fewer founders imply a 

257 stronger founder effect and hence a stronger signal of expansion [35], which makes the origin easier to 

258 detect. We find the biggest difference in how our method performs in comparison to the Ramachandran 

259 method when slowing down the expansion, or when we want to detect an expansion that occured some 

260 time in the past. Interestingly, our method performs at almost the same accuracy independent of the 

261 expansion speed, whereas the accuracy of the Ramachandran method declines faster. Also, we find that 

262 the heterozygosities approach equilibrium soon after the expansion has finished ((BJ 7 ), whereas the shared 

263 alleles used by ip keep the signature of the range expansion for much longer. 

264 Adding Environmental Complexity 

265 The previous section assumes an idealized population in a homogeneous habitat. In practice, however, 

266 habitats are heterogeneous and barriers to gene flow and pathways of expansion are often very important. 

267 In the following sections, we show how our method performs in slightly more complex scenarios. First, 

268 we allow denies with different population sizes. While we kept the mean size of demes the same, we 

269 followed |42j in drawing deme sizes from a gamma distribution. A second important feature not present 

270 in the previous simulations are barriers to dispersal that affect both the initial expansion and gene flow 

271 following the expansion. Depending on the species, these barriers may correspond to rivers, mountains 

272 or even roads. We illustrate how we can use graph algorithms and the directionality index to identify 

273 them. Finally, we explore the case with an expansion starting from multiple points, and how we can infer 

274 the coordinates of the origins. 

275 Heterogeneous Population Sizes 

276 The effect of variance in deme size on demographic expansions was explored extensively in a simulation 

277 study by Wegmann et al. [H] . They found that heterogeneous populations have a higher rate of population 

278 differentiation between demes, and predicted that detecting range expansion would be more difficult 
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because of the increased noise. Our simulations confirmed this prediction but only if there is substantial 
variation in deme size. We found that heterogeneity in deme size has little effect if the variance in 
dome size is low, with RMSE only differing slightly from the case with equal deme sizes. A variance 
of 0.5 in deme size, for example, corresponds to a size difference of around two orders of magnitude 
between the largest and smallest deme. But the average RMSE for the location estimate only increased 
to 5.43, compared to 4.57 in a comparable scenario without variation in deme size. However, this value 
corresponds to some kind of "tipping point" : when we further increasing the variance in deme size, 
some deme sizes will become effectively zero in size and this greatly reduces the accuracy of the location 
estimate, indicating that significant barriers to gene flow make our location estimate less precise. 



We can use pairwise directionality indices to gain information about colonization paths, i.e. the corridors 
through which the population expanded. We approach this problem by interpreting a set of pairwise 
directionality indices as a directed graph, where the samples correspond to the vertices and the direc- 
tionality indices correspond to the edges connecting the samples. To achieve a visual representation of 
the graph, we apply graph algorithms to remove some of the edges. In particular, we use the transitive 
reduction algorithm [33]. A transitive reduction finds the graph with the fewest edges that retains the 
connectivity of the original graph. That is, if there is an edge from vertex v\ to vertex V2, but there is 
also an indirect path from v\ to v-i via another vertex V3 (i.e. psi is significantly negative between V\ 
and V3 as well as between V3 and 1*2), then the edge from v\ to 1)2 is removed from the graph. A further 
reduction can be obtained by computing a maximum spanning tree |44j , which reduces the graph to n — 1 
edges, where n is the number of samples. The maximum spanning tree representation should be able to 
identify major migration paths, and does not cross strong barriers of gene flow (Figure [7]). Furthermore, 
we can obtain an ordering of all samples by simply summing all ip values that sample is involved in: 



The lowest value of tpi among the samples denotes the sample that was taken closest to the origin, and 
the highest value of ipi is the sample furthest along the expansion. In Figure 03 we show that both the 
maximum spanning tree and the ordering are useful tools and able to identify the barriers. 



Barriers 




(4) 



j'Gsamples 
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305 Multiple Origins 

306 Range expansions may have more than one source. A classical example is the colonization of Central 

307 Europe after the last glacial maximum. Species with Southern European refugia in the Balkan Penisula, 
3os Italy or the Iberian peninsula followed the receding glaciers and explain many biogeographical pattern we 

309 observe today [3]. Our method extends in a straightforward manner to such expansions. The key idea is to 

310 first find samples that were predominantly colonized from a single origin, and then estimate the position of 
3n that origin independently. There are several ways to assign sampled individuals to clusters corresponding 

312 to a single origin. In classical studies, often mtDNA haplotypes were used for this purpose (e.g. [2K]), but 

313 programs such as STRUCTURE [19| or simple clustering based on the observed polymorphism frequencies 

314 may yield more accurate results. In our simulations, a simple K-means clustering algorithm was able to 

315 correctly identify the number of clusters in all cases, even when the two founder populations were drawn 

316 from the same original population. The resulting estimates of the origins are slightly less precise than 

317 with a single origin (Figure [8]) , but that is to be expected as there are fewer samples contributing to 
3is the location estimate for each origin. Also, the estimates were worse when the two origins were close 

319 together. 

320 Application 

321 Human Diversity 

322 Wc applied our method to a data set from 55 human populations from the Human Genome Diversity 

323 Panel and HapMap III [4"5"H4"T] . We calculated ip and its standard error for all pairs of populations and 

324 transformed this into a Z-score. As expected from a data set with several hundred thousand loci, the 

325 vast majority of comparisons were highly significant, with a median absolute Z-score of 28.1, and a mean 

326 absolute Z-score of 41.9 across all comparisons made. Globally, we could make out four major clusters 

327 of populations: i) Africans, ii) Europeans and Pakistani, iii) East Asians and iv) Native Americans. 

328 Overall, every one of the 450 comparisons made between a population in Africa and Non- African showed 

329 evidence for gene flow out of Africa, confirming the out-of- Africa hypothesis. Within Africa, we found all 

330 comparisons to be significant, and all pairwise ip values agreeing on a single origin of the expansion. The 

331 San people were the only population that had positive ip values when compared to all other populations, 

332 indicating that they are closest to the origin of humans. They are followed by the Biaka- and Mbuti- 
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333 pygmies, which arc have negative ip values when compared to the San. This is followed by the southern 

334 Bantu sample, and a cluster consisting of Yerubans, Luhya, Mandenka and Northern Bantu, each having 

335 a negative ip for other previously mentioned populations, and positive scores for all other populations. 

336 The African populations furthest away from the origin were the Maasai and Mosabite, the latter being 

337 very distinct from the sub-Saharan populations. 

33s The closest populations outside Africa are the Bedouin and Palestinian populations, both from the 

339 Middle East. The third Middle Eastern population present in our data, however, the Druze people, fall 

340 in a large cluster containing almost all European, Pakistani and Indian populations. Within Europe, the 

341 three Italian population samples all have non-significant ip scores, but are found to be more ancestral 

342 to the other European populations. They are followed by the French and French-Basque, which also 

343 cannot be distinguished, and the Orcadian, Adygei and Russians. In Pakistan, we find the Makrani to 

344 be the most ancestral population, followed by the Brahui and Balochi, Sindhi, Kalash and Burusho. It is 

345 noteworthy that this list corresponds to their distances from Africa, with the exceptions that the Brahui 

346 and Balochi are switched, and the Hazara are not in the main Pakistani cluster, but rather form a distinct 

347 group with the Uygur. Besides the Uygur, all other East Asian populations form a single large cluster with 

348 very little resolution. Clearly distinct from this cluster are the Papuans and Melanesians, which are very 

349 similar with asymmetry between these two populations^ = 0.0019, SEip = 9.2e — 4, Z = —2.05). They 

350 are closer to the Africans than to the East Asians, but further away than the Pakistani and European 

351 populations. 

352 Finally, Native American populations form a distinct cluster, which are strongly separated from all 

353 other populations. Within the Native American populations, we find evidence of a North to South 

354 colonization pattern with Pima being closest to the Eurasian populations, followed by the Maya and 

355 Colombians. The most distant populations are the South American Karitiana and Surui, which have a 

356 nonsignificant pairwisc ip between them. 

357 We also tested our ability to infer the origin of humans using the TDOA approach. As continents 

358 most likely act as strong migration barriers, we did not use the TDOA approach on the entire HGDP 

359 data set. Instead, we applied our method to the data set of Henn et al [55] which contains 30 African 

360 populations. We estimate an origin of the Human expansion at 30° S 13° E, which lies in central South 

361 Africa, closest to the San sample locations at 28.5° S 21° E and 22° S 20° E, respectively. 
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362 Discussion 

363 In this paper, we introduced a new statistic, the directionality index tp. We then showed that tf> can be 

364 used to reject an equilibrium isolation-by-distancc model, and we used it both to characterize a range 

365 expansion and estimate its origin. Although we have focused on range expansions, ip is sensitive to 

366 other deviations from symmetric migration. While a range expansion might be a plausible explanation 

367 in many cases, alternative scenarios such as a source-sink population structure or a large differences in 

368 effective population sizes should also be considered. One of the main advantages of the simplicity of the 

369 directionality index is that the assumptions - and limitations of the approach are easy to discern: the 

370 directionality index is zero if the 2D-SFS is roughly symmetric about the diagonal. This is certainly true 

371 under most models considered in theoretical studies, such as island and stepping stone models, particularly 

372 as the boundary conditions in the latter are typically chosen such that the model is symmetric. The 

373 directionality index can be used to determine how appropriate these models are for a given data set. If 

374 ip differs from zero then care should be taken in applying methods that are based on these theoretical 

375 models. On the other hand, if ip is close to zero, we can interpret this as justification for using the 

376 powerful theoretical results for these models •''>(> . 

377 In this regard, the directionality index can be seen as a "first step" analysis that can be computed very 

378 easily, is able to answer very broad questions about our data and then be used as a guide to which more 

379 complicated parametric models might be employed, e.g. in an Approximate Bayesian Computation [491150] 

380 or dadi |28j framwork. We have also shown how we can introduce the physical location of the samples 

381 in our inference framework. In many cases, natural populations are better described by a continuous 

382 distribution .51(52], and as we show in the TDOA analysis, using a simple statistic together with the 

383 physical locations can result in a quite powerful method. Our approach is also different from most 

384 other methods dealing with spatial data in that it explicitly assumes a non-stationary population. In 

385 this paper, we link the ancestral demographic process of a range expansion to the observed patterns of 

386 genetic diversity. While the effect of the expansion on F$t appears to be quite small, our ip statistic can 

387 be used to distinguish between equilibrium and non-equilibrium models. Finally, we also show how we 

388 can extend our method to deal with complications. Whereas the TDOA analysis is not robust to large 

389 barriers of gene flow, interpreting the pairwise -0 statistics as a graph can unmask important details of a 

390 species' history. 
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391 Simulation Results 

392 We find that the directionality index ip is well suited to distinguishing between isolation-by-distance and 

393 range expansion when demes are sufficiently far apart and the range expansion is recent and occurs at 

394 a fast rate. These restrictions are not surprising. Geographically close demes will be genetically more 

395 similar, regardless of their history, and historical processes should therefore be harder to distinguish. 

396 That a recent expansion is easier to detect than an older one is also easily explained by the eventual 

397 convergence to equilibrium isolation-by-distance pattern, and similarly, a rapid range expansion leaves 
39s less time for genetic drift to blur the patterns created by the range expansion. Lastly, increasing the 
399 amount of data will increase the power to distinguish asymmetric from symmetric processes as each 
mo single SNP contributes only a little information about the history of dispersal. In all cases, our tp 

401 statistic outperforms AH. From the analyses under the stepping-stone model we see one of the main 

402 differences between tp and genetic distance measures, such as Fst- In an isolation- by-distance model, 

403 Fst will increase with distance, but ip will not deviate from zero as the distance between the sampled 

404 locations increases. Again, this makes sense intuitively: The number of shared genetic variants decreases 

405 with distance, and hence Fst increases. However, this reduction in shared polymorphisms is symmetric, 

406 and hence will have no effect on ip. The pattern is different in the model of a population expansion: when 

407 comparing with a sample from the origin of expansion, both Fst and ip increase with distance. The 

408 signal diminishes, when migration rates are high, however. This is apparent from Panel D in Figure [TJ 

409 where ip is zero for the first ten demes. Here, migration had enough time to undo the effect of the range 

410 expansion in the demes that are further away from the origin. 

4n In the origin estimation section, we find that we can get surprising accuracy with relatively little data. 

412 20 samples with around 10,000 SNP yield accurate estimates. This amount of data indicates that our 

413 method is not applicable to mtDNA or microsatellite data, but it should be applicable to transcriptome 

414 data, which can be assembled for many non-model organisms. It is also worth noting that the error does 

415 not go to zero even with larger amounts of data. This can have several reasons. A big contribution 

416 is likely from the linearity assumption we made for the TDOA approach, ip does not increase perfectly 

417 linearly with distance, and especially the boundaries of the simulated region may introduce a considerable 

418 bias. A second, more subtle point is the algorithm we use; whereas least-squares is very easy to use and 

419 yields good results, other optimization algorithms might reduce the RMSE. A third explanation is that 

420 genetic processes are simply very noisy, and we require much more data to obtain better results. Our 
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421 results also show that using heterozygosity to infer the origin of an expansion is largely similar to our 

422 statistic for recent, fast expansions. 

423 We demonstrated how our method can be adapted to incorporate more complex models. We showed 

424 that small differences in deme sizes have little influence on our ability to estimate the origin. If however, 

425 the habitat is very heterogeneous our method becomes less accurate. This implies that some care must 

426 be taken when for example analysing endangered species with very patchy habitats, or species whose 

427 dispersal distance is much lower than the size of a local population, or when analyzing global patterns 
42s of diversity If there are strong barriers to the gene flow, the TDOA method should not been applied, as 

429 its central assumption that ip IS proportional to physical distance is violated. In that case, while it is not 

430 possible to infer an origin that is distinct from the samples, it is nevertheless possible to find the sample 

431 that is closest to the origin, which in many cases might suffice to support or reject a hypothesis. Also, 

432 we have shown that we can apply graph algorithms to get a representation of the migration pattern that 

433 leads to meaningful interpretation. 

434 Human Genetic Diversity 

435 When analysing the human data sets, we found that i) ip scores are correlated with distance and ii) if 

436 population i is closer to Africa than population j, then ip(i,j) is in most cases negative, a pattern that 

437 is expected under a model of expansion from Africa. As explained previously, the directionality index 
43s depends not only on the two population compared but also on the history of the other populations. We 

439 find the South African San people to be the population closest to the origin of humans both using the 

440 TDOA method and when interpreting all pairwise directionality indeces. This supports the interpretation 

441 that the origin of modern humans is somewhere in Southern Africa [5JU5] . Another interesting result is 

442 that the Melanesian and Papuan samples, while very similar, show positive ip values when compared to 

443 other East Asian populations, but the directionality index is negative when compared to the Pakistani, 

444 European and African populations. This is consistent with a "two-wave" model of colonization of South- 

445 East Asia, with a first wave consisting of present-day Papuans and Melanesians, and a second wave 

446 consisting of the present day Chinese populations [53] . 
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447 Methods 

443 Simulations 

449 Wc implemented a simulator that performs continuous time coalescent simulations on a discrete stepping 

450 stone model [HHTO] of finite size. We assumed that the backward migration rates were equal between 

451 all pairs of adjacent demes and that the boundaries were reflecting. We used a modified version of the 

452 expansion model of [35] . where an expansion is modeled with a one-generation bottleneck of reduced 

453 size. In our backward-in-timc framework, this corresponds to moving all lineages present in a deme being 

454 colonized to a randomly chosen neighboring deme. We introduce a founder effect by adding additional 

455 coalescence events according to the appropriate backward Wright-Fisher transition probability (Page 62 

456 in [M])- Unless noted otherwise, all expansions were done with a founder size of 200. Once the final 

457 deme is reached, an regular island model coalescent is run where each island corresponds to a founder 

458 population (in most simulation, the number of islands is one) . 

459 Throughout this paper, we simulated unlinked SNPs using an importance sampling scheme. After 

460 generating 1,000 gene trees, we calculate the appropriate multi-dimensional site frequency spectrum, 

461 where each sampled population corresponds to a dimension. We can then draw SNPs with replacement 

462 from this site frequency spectrum. 

463 The parameters used for the majority of power simulations are as follows: We simulated on a 101 x 

464 101 stepping stone model, with deme coordinates starting at (-50,50) at the lower left corner and (50,50) 

465 in the upper right corner. Each deme exchanges migrants to the neighboring demes to the north, south, 

466 cast and west at rate M = 1. For the power simulation, wc sampled a single diploid individual each from 

467 two colonies at (-25,-25) and (-25,25). For the TDOA simulations we simulated one individual each from 

468 a deme on a quadratic grid between (-30,-30) and (30,30), with 36 samples in total. This corresponds to a 

469 distance of 12 demes between any two sampled demes. We usually generated 1,000 independent coalescent 

470 trees and then used importance sampling to generate 100,000 SNP from the population, conditioning on 

471 them being shared between at least two of the samples. In the case of a range expansion, the standard 

472 point of origin was set to (-15,-25) and the expansion occurred at a rate of one expansion event every 

473 0.001 coalescence units, with the expansion being observed 0.13 coalescent units after it started, where 

474 coalescent units are measured on the time scale of a local deme. These parameters were chosen to roughly 

475 correspond to the human out-of- Africa expansion. The directionality index i\) and F$t were calculated 
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476 in Python; for ijj we used equation ©, and F$t was estimated using Reynold's estimator [T3]. Note that 

477 these are only baseline parameters, and exploring the effect of changing these parameters is the main 
47s purpose of most of our power simulations. 

479 To generate data for the ID stepping stone model analyzed in Figure 1, we simulated a 201 x 1 

480 habitat, with scaled migration rates M = 1,10 between adjacent demes. Sampling was done in domes 

481 — i/2 and i/2, with the center deme having coordinate 0. In case of range expansions, the expansion 

482 started in dome —i/2. 

483 SNP ascertainment may influence our results, because most ascertainment schemes favor high fre- 

484 quency alleles in the populations where the ascertainment was performed. To assess the effect of ascer- 

485 tainent bias on the value of tjj, we performed simulations in an isolation-by-distancc model with samples at 

486 coordinates (0,0), (10,0), (20,0), (30,0), (40,0), (50,0) as well as (0,10) and (15,10) and then computed tp 

487 between the (10,0) and (20,0) sample. We then simulated ascertainment by selecting a set of population, 

488 and rejection sampling SNP so their 1D-SFS followed a Beta(2,4/3) distribution, which roughly matches 

489 the SFS in the HGDP data set and is very different from the expectation without ascertainment bias. 

490 If tp differs significantly from zero, then we know that ascertainment is important. Results are given in 

491 Figure SI; ascertainment is important if it is performed in one of the populations that we calculate ip for. 

492 However, the effect of ascertainment is negligible if the population we calculate t\> for are different from 

493 the ascertainment population, even if the ascertainment population is much more closely related to one 

494 population compared to the other. 

495 Estimating the origin of a range expansion 

496 We use a time-difference of arrival (TDOA) approach [JTj to estimate the origin of a range expansion. 

497 TDOA was originally used in naval navigation during the Second World War, and is currently widely 

498 used to solve localization and navigation problems. It is based on the assumption that a single source 

499 emits a signal that decays with increasing distance from the origin. For range expansions, this signal 

500 can be thought of as the mean allele frequency of shared alleles. At the origin, the allele frequency is 

501 expected to be lowest [35j and to increase approximately linearly with distance. However, since we do 

502 not know the allele frequency at the origin, we have to use the indirect approach by comparing pairs of 

503 populations. To be precise, if we know that shared alleles have a lower frequency at point Si compared to 

504 point Sj, then we know that Si is closer to the origin than Sj. If the habitat is two-dimensional, however, 
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505 this does not tell us the direction of the expansion. Let \\Si, Sj\\ denote the Euclidean distance between 

506 two points Si and Sj. Then, 

0||- 0|| (5) 

507 where O denotes the unknown origin ijiij is the directionality index between samples Si and Sj and v is 

508 a constant that links space to allele frequency (i.e how much does the allele frequency change per unit 

509 of space). In words, ipij is approximately proportional to the difference of the distances |<Si,0|| and 

510 ||Sj,0|| (see also Figure [5]). We assume that the sampling locations of Si and Sj are known without 
5n error, and that ipi j can be estimated from genetic data, along with its sample variance Var(^j). We 
512 estimate the variance by doing 1,000 bootstrap replicates on the SNP. The unknowns that remain are 
so the coordinates of the origin O and the proportionality constant v. To infer these parameters, we solve 
5H for ip, subtract tp from the equation and sum over all pairs of samples: 

(n A \- i f ws^ow-ws^ow \ 

\0,v] = argmax > — — — — - ■ ■ — Wi i \ ■ (6) 

V ' o,v ^Var(^) V « V 

515 In most biological application, space will be two-dimensional and therefore we can make this equation 

516 more explicit by writing O = (x,y) and Si = (xi,yi). Then, 

(x, y, v) = axgrnax^ Var ^, 1 Q (V ( x i ~ x ) 2 + ~ v) 2 ~ \J ( X J - x ) 2 + (Vj - - ^h^j ■ ( 7 ) 

517 The variance terms correspond to weighting terms; terms where ip has a high variance are weighted down, 
5is whereas terms where we can infer i\) with high accuracy are given a larger weight. We can then find a 
519 solution to this equation using nonlinear least squares. 
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Figure 1. Behavior of H (red), ip (black) and F$t (blue) in one-dimensional (A) 
isolation-by-distance and (B) population-expansion models. Simulations were performed on a 
200-stepping stone model with scaled migration rate M=100 between adjacient denies, and expansion 
events every 0.001 coalescence units. F$t increases linearly with distance in both models and ip is zero 
in the isolation-by-distance model, but increases approximately linearly in the expansion model. 
Heterozygosity increases from the center of the population (left) to the border of the habitat (right). 
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Figure 2. Behavior of F$t and ip in isolation-by-distance and population expansion model. 

Each panel gives the value of the pairwise statistics F$t (Panels A and C) and ip (Panels B and D) 
under an isolation-by-distancc model (Panels A and B) and an expansion model (Panels C and D) 
starting in the central deme (50,50). Simulations were performed on a 101 x 101 deme stepping stone 
model, and a diagonal transect from demes at coordinates (0,0) to (100,100) was sampled, and all 
pairwise statistics were calculated. Blue regions correspond to regions where F$t and ip are very low 
(below 1%). The orange and grey regions denote areas with positive and negative ip, respectively. An 
Whereas F$t behaves qualitatively similar under both models, the behavior of ip is very different. 
Under isolation-by-distance, ip is very close to zero, with some deviations due to boundary effects. 
Under an expansion, however, we see a clear signal of an expansion for all demes, except demes that are 
very close to each other, or demes that have the same distance to the origin, but in different directions. 
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Figure 3. True/false positive rates of detecting range expansion Each panel give the 
proportion of replicates in which the null model was rejected at the 5% significance level. Filled dots 
denote simulations under an expansion model, and open dots correspond to an isolation-by-distancc 
model. Black dots corresponds to using the directionality index, and the red dots were generated using 
the difference in heterozygosity as a statistic. The grey dashed line at 0.05 gives the expected 
proportion of false positives under the null hypothesis. Baseline parameters for the simulations were of 
2 chromosomes (one diploid individual) at each location sampled, with locations a distance of 50 each 
other. Each sample consisted of 1,000 independent SNPs shared between the two populations. 
Significance was assessed using a binomial test. 
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Figure 4. Illustration of the method used to infer the origin of a range expansion. The 

black and grey points correspond to genetic samples taken, the white point corresponds to the 
(unknown) origin of the expansion. Using the directionality index ip, we can infer the difference in 
distance from the samples to the origin (orange line). The set of all points that has the same difference 
in distance to the origin corresponds to the arm of a hyperbola (yellow), with some candidate points 
outlined with the dashed point. Using a second pair of points (the grey and top black point), we can 
identify a second hyperbola, and find an unique location of the origin. In practice, we use more than 
three sampling locations. Sampling noise will cause the hyperbolas to not intersect in a single point. 
We use a least-squares criterion to estimate the location of the origin. 
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Figure 5. Detecting the origin of a range expansion. Each panel corresponds to a 101 x 101 grid 
of populations that were simulated. The expansion began at point (25,35) (indicated by gray dotted 
lines). Black bordered circles indicate sampling locations, black arrows correspond to tp > 1% between 
adjacent samples, with the direction of the arrow indicating the sign of -0. Thicker arrows correspond to 
larger -0. The red ellipse corresponds to the 95% confidence interval of the estimated location of the 
origin. Panel a: no expansion (isolation- by-distance model). Edge effects cause the estimated origin to 
be close to the center of the grid of populations. Panels b-d: Expansion with parameters M = 1, t = 0.1 
and samples taken every 10th, 20th and 50th deme. While the confidence region is larger for smaller 
numbers of samples, we get a very accurate result even when we have only 9 samples. 
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Figure 6. Performance of TDOA method. We present the root mean squared errors of our TDOA 
method (black) compare it with the method of Ramachandran et al. 2005 (red). Samples taken on a 
grid ware represented by full lines, whereas dashed lines denote samples that were taken from random 
coordinates in the simulated region. Our method is superior when the expansion occured slowly or 
when it finished some time in the past; but the method perform very similar for recent, fast expansions. 
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Figure 7. Identifying complex patterns of migration. We simulated data on a S-shaped habitat 
with two impermeable barriers (Panel A) The darkness of the shading is proportional to the arrival 
time of the expansion, which began in deme (20,20). Black circles correspond to locations sampled. In 
Panel B we show the inferred pairwise directionality, with all edges remaining after thinning the graph 
shown in grey, and a maximum spanning tree in red. We also show the inferred ordering of the samples 
as a color gradient of the samples from light (closest to origin) to dark. The barriers can be identified 
from panel B by the absence of any indication of gene flow across the barriers and by examining the 
ordering of the samples. 
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Figure 8. Detecting multiple origins. Panel a: Wc simulated two expansions that originated at the 
same time from origins indicated by the blue crosses. The color gradient in the background corresponds 
to the time of colonization time of each deme. We address the problem of inferring the origin of multiple 
expansions using a two-step procedure. First, we cluster the samples into discrete clusters (red and 
black, respectively) and then estimate the expansion signal and origins independently for the clusters, 
resulting in high accuracy for both origins (green X). The right panel shows the inferred migration 
patterns after a transitive reduction (grey /red arrows) and a maximum spanning tree (red arrows). 
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Figure 9. Inference of human migration routes. The figure shows a visual representation of the 
pairwisc directionality indices between human populations in HGDP and HapMap. Each line 
corresponds to the pairwise ip statistic, with thicker and brighter lines corresponding to higher values. 
Grey and red lines denote eastward and westward migration, respectively. Lines with an absolute 
Z-scorc below 5 were omitted. 
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Figure 10. Effect of ascertainment bias on ip. We show the effect of strong ascertainment bias in 
different domes given on the x-axis on ip calculated between samples taken from coordinates (0,10) and 
(0,20). N = no ascertainment. Ascertainment has very little effect if it is performed in a deme that is 
not used in the comparison. 



